Method and system of ranking and clustering for document indexing and retrieval

ABSTRACT

A relevancy ranking and clustering method and system that determines the relevance of a document relative to a user&#39;s query using a similarity comparison process. Input queries are parsed into one or more query predicate structures using an ontological parser. The ontological parser parses a set of known documents to generate one or more document predicate structures. A comparison of each query predicate structure with each document predicate structure is performed to determine a matching degree, represented by a real number. A multilevel modifier strategy is implemented to assign different relevance values to the different parts of each predicate structure match to calculate the predicate structure&#39;s matching degree. The relevance of a document to a user&#39;s query is determined by calculating a similarity coefficient, based on the structures of each pair of query predicates and document predicates. Documents are autonomously clustered using a self-organizing neural network that provides a coordinate system that makes judgments in a non-subjective fashion.

FIELD OF THE INVENTION

[0001] The relevancy ranking and clustering method and system fordocument indexing and retrieval of the present invention is intended toprovide mechanisms for an information retrieval system to rank documentsbased on relevance to a query and in accordance with user feedback. Auser can make queries in the form of natural language, keywords orpredicates. Queries are converted into ontology-based predicatestructures and compared against documents, which have been previouslyparsed for their predicates, to obtain the best possible matchingdocuments, which are then presented to the user. The present method andsystem is designed to automate judgments about which documents are thebest possible matches to a query within a given index. The system isfurther designed to allow users to provide feedback in order tofine-tune the automated judgment procedure.

BACKGROUND OF THE INVENTION

[0002] As the volume of information available on the Internet increases,the need for methods to search, filter, and manage such information isincreasing. Text categorization has become an important component inmany information search and retrieval systems. Conventional search andretrieval systems commonly classified information into severalpredefined categories. For example, Yahoo!'s topic hierarchy provides acomplex tree of directories to help users locate information on theInternet. In many search engine companies, trained text editors manuallycategorize information. Such manual classification of information is notonly a very time-consuming and costly process, but is also plagued byinconsistencies and oversight problems. To overcome such problems,automated methods for categorizing text are becoming more common.

[0003] U.S. Pat. No. 5,418,948 to Turtle discloses an informationretrieval system, which stems all input words (as well as removingstopwords), and matches the resulting queries against a table of knownphrases in order to convert phrasal inputs into a standardized format. ABayesian inference network ranks the results, where each document isassociated with a set of probabilities for all of the words within thedocument. These probabilities are calculated with respect to a hierarchyof document-organization categories. Retrieval may be accomplishedthrough two techniques, which can result in different rankings for thesame collection of documents.

[0004] The first technique used by Turtle is document-based scanning,where each document is evaluated according to the probabilities of eachof its attributes in order to determine the probability that it answersthe query. After a sufficient number of documents are retrieved, whilescanning continues through the collection, documents are only evaluatedthrough a subset of their attributes. This means that after somecritical number of documents is reached, documents which are unlikely torank higher than the lowest-ranking document in the set are not added tothe list of results.

[0005] The second technique involves so-called concept-based scanning,where all documents containing a query concept (including its synonymsas defined in a thesaurus) are evaluated according to the probability ofthe query attribute within the document. This means that only a fewattributes are examined for each document, but they are the same for alldocuments. As with document-based scanning, documents are no longeradded to the result set when a critical number is reached and theprobability of a new document outranking any documents currently withinthe set is extremely low. The stopping criteria are not identical, andthe interpretation of the same attribute probabilities may lead todifferent rankings for the same documents, even when matched by the samequery. In both cases, scoring is calculated by averaging theprobabilities for all of the attributes in the query (adding theindividual probabilities, then dividing by the number of concepts).

[0006] Turtle's system is deficient in several respects. First, byscoring documents according to the relative occurrences of terms withinthe index, highly relevant documents with low-probability concepts maybe missed. Second, periodic recalculation of attribute probabilities isa necessary performance penalty, if new documents will change theprobability distribution of the attribute set. Third, thethesaurus-based approach treats synonyms as equally valuable terms in aquery. This may expand the result set such that the stopping criteriadescribed above end up filtering out documents containing the exactmatch in favor of documents containing a higher proportion of synonymouswords. It is not clear that this is a desirable result from thestandpoint of an end-user who is particularly interested in the exactword used for the query. Turtle's system does not take grammaticalstructure into account; in fact, it does not take adjacency informationinto account, since each document is treated as a “bag of words,” withno preservation of order information.

[0007] U.S. Pat. No. 4,270,182 to Asija discloses a system for askingfree-form, un-preprogrammed, narrative questions. The system of Asijaaccepts unstructured text from multiple sources and divides the textinto logical information units. These logical information units may besentences, paragraphs, or entire documents; each logical informationunit is assigned a unique identification number, and is returned as awhole when it is selected for retrieval. The retrieval system of Asijauses standard keyword-based lookup techniques.

[0008] The procedure of Asija only applies to the logical informationunits, which are ranked as equally relevant at the end of a precedingstage. Both synonyms and searchonyms are considered as equivalent toquery words found within the logical information units. The net effectof the ranking and filtering process of Asija is to order documents bymaximizing the number of query words matched, followed by the number ofinstances of query words. Furthermore the Asija system does not takegrammatical structure into account. In addition, synonyms are not exactmatches for queries, and thus should be ranked lower. The Asija systemalso only makes use of literal text strings, as all synonyms must bespecified by dictionary files that list text strings as equivalent.

[0009] A key feature of the present invention is the unique and novelmethod of representing text in the form of numerical vectors. Thevectorization techniques of the present invention offer severaladvantages over other attempts to represent text in terms of numericalvectors. First, the numbers used are ontologically generated conceptrepresentations, with meaningful numerical relationships such thatclosely related concepts have numerically similar representations whilemore independent concepts have numerically dissimilar representations.Second, the concepts are represented in the numerical form as part ofcomplete predicate structures, ontological units that form meaningfulconceptual units, rather than simple independent words. Third, thevectorization method and system described herein provides a way torepresent both large portions of long documents and brief queries withvector representations that have the same dimensionality. This permitsrapid, efficient relevancy ranking and clustering by comparing the queryvectors with substantial portions of documents, on the order of a pageor more at a time, with no loss of accuracy or precision. Furthermore,it permits comparisons of large-scale patterns of concepts across entiredocuments rather than the small moving windows used in prior systems.These advantages provide the present method and system with uniqueperformance and accuracy improvements over conventional systems.

SUMMARY OF THE INVENTION

[0010] The basic premise of relevancy ranking and clustering is that aset of documents is sorted or ranked, according to certain criteria andclustered to group similar documents together in a logical, autonomousmanner.

[0011] The relevancy ranking and clustering method and system of thepresent invention scores documents by word meaning and logical form inorder to determine their relevance to the user's query. It also comparespatterns of concepts found within documents to the concepts within thequery to determine the most relevant documents to that query.

[0012] As part of the relevancy ranking and clustering method andsystem, documents and user queries are first parsed into ontologicalpredicate structure forms, and those predicate structures are used toproduce a novel numerical vector representation of the original textsources. The resulting vector representations of documents and queriesare used by the relevancy ranking unit and the document clusteringcomponent of the present invention to perform the ranking and clusteringoperations described herein. The unique and novel method of producingthe vector representations of documents and queries provides efficiency,accuracy, and precision to the overall operation of the relevancyranking and clustering method and system.

[0013] Input queries and documents are parsed into one or more predicatestructures using an ontological parser. An ontological parser parses aset of known documents to generate one or more document predicatestructures. Those predicate structures are then used to generate vectorrepresentations of the documents and queries for later use by theranking and clustering method and system.

[0014] The ranking and clustering method and system performs acomparison of each query predicate structure with each documentpredicate structure, and of document vectors to query vectors, todetermine a matching degree, represented by a real number. A multilevelmodifier strategy is implemented to assign different relevance values tothe different parts of each predicate structure match to calculate thepredicate structure's matching degree.

[0015] When many documents have a high similarity coefficient, theclustering process of the relevancy ranking and clustering methodprovides a separate, autonomous process of identifying documents mostlikely to satisfy the user's original query by considering conceptualpatterns throughout each document, as opposed to individual concepts ona one-by-one basis.

[0016] The relevancy ranking and clustering method and system of thepresent invention provides a fine-grained level of detail for semanticcomparisons, due to the fact that conceptual distance can be measuredand weighted absolutely for all terms within a query. In addition, therelevancy ranking and clustering method and system of the presentinvention provides a sophisticated system for ranking by syntacticsimilarity, because syntactic evaluation occurs on lists of predicatearguments. The manner in which the arguments are derived is irrelevant,and can be accomplished through any syntactic parsing technique. Thisprovides a more general-purpose ranking system.

[0017] The relevancy ranking and clustering method and system of thepresent invention ranks according to grammatical structure, not mereword adjacency. Thus, a passive sentence with more words than anequivalent active sentence would not cause relevant documents to beweighted lower. The relevancy ranking and clustering method and systemof the present invention makes use of actual word meaning, and allowsthe user to control ranking based on word similarity, not just presenceor absence.

[0018] The relevancy ranking and clustering method and system of thepresent invention also ranks according to conceptual co-occurrencerather than simple word or synonym co-occurrence as is found in othersystems. This provides the advantage of recognizing that relatedconcepts are frequently found near each other within bodies of text. Therelevancy ranking and clustering method and system furthermore considersoverall patterns of concepts throughout large documents and determinesmatches to query concepts no matter where in the document the matchesoccur.

[0019] The relevancy ranking and clustering method and systemadditionally provides simple and efficient means of recognizing thatfrequency of occurrence of concepts within a document often correspondto relative importance of those concepts within the document. Thepresent invention autonomously recognizes frequently occurring queryconcepts located within a document and judges such documents as morerelevant than documents in which the query concepts occur only rarely.The autonomous and efficient method of accomplishing this is based bothon the unique vectorization techniques described herein and on theoperation of the ranking and clustering method and system.

[0020] The relevancy ranking and clustering method and system of thepresent invention allows the user to specify whether documents similarto a particular document should be ranked higher or lower, andautomatically re-ranks such documents based on a neural network. Theneural network provides a coordinate system for making such judgments inan autonomous and non-subjective fashion, which does not requiretrial-and-error efforts from the user. Finally, there is no codegeneration or recompilation involved in the present system, which onlyperforms the needed document clustering once; requests for similar ordifferent information return a different segment of the result set, butwithout recomputing the relations between all the documents, as isrequired in a spreadsheet-like or other statistical approach.

[0021] The relevancy ranking and clustering method and system of thepresent invention uses grammatical relationship information to adjustranking relations. Although words do not need to be grammaticallyrelated to each other within a document to include the document in theresult set, such relationships serve to adjust the rankings of otherwisesimilar documents into a non-random, logical hierarchy. Furthermore,each word within the present system is a meaningful entity withmathematically meaningful distances relative to other concepts. Thus,synonyms are not treated as probabilistically equal entities, but areassigned lower rankings depending on how far they are from the exactquery word given by the user.

[0022] Documents containing sentences that logically relate query termsare ranked higher than documents which simply contain instances of thoseterms. Similarly, thesaurus-like query expansion is made possiblethrough the use of ontologies, and the present ranking system enablessimilar concepts to be graded according to the degree of theirsimilarity. This capability represents a significant innovation overother purely statistical techniques.

[0023] In addition to giving higher weights to documents where searchterms occur in close proximity, the relevancy ranking method and systemof the present invention is able to make further discrimination bywhether or not the search terms are bound together in a single predicatewithin a document. Additionally, the relevancy ranking and clusteringmethod and system of the present invention is capable of discriminatingbetween documents based on conceptual similarity so that conceptuallysimilar, but inexact, matches receive lower weights than exactly matcheddocuments.

[0024] In the relevancy ranking and clustering method and system of thepresent invention, the vector representations of individual documentsand user queries are based not on individual words but on patterns ofconceptual predicate structures. Dynamic alteration can be made to thecontent of the document sets, thus allowing the relevancy ranking andclustering method and system to begin its processing even before thesearch for potential matching documents is complete.

[0025] As a result, the relevancy ranking and clustering method andsystem of the present invention provides an automatic process to clusterdocuments according to conceptual meanings. The present system isdesigned to make fine discriminations in result ranking based on thedegree of conceptual similarity between words, i.e., exactly matchedwords result in higher rankings than synonyms, which in turn result inhigher rankings than parent concepts, which in turn result in higherrankings than unrelated concepts.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] These and other attributes of the present invention will bedescribed with respect to the following drawings in which:

[0027]FIG. 1 is a block diagram illustrating a relevancy rankingcomponent according to the present invention;

[0028]FIG. 2 is a block diagram illustrating a relevancy ranking methodperformed by the relevancy ranking component shown in FIG. 1, accordingto the present invention;

[0029]FIG. 3 is Table 1 showing examples of modifier names and weightsaccording to the present invention;

[0030]FIG. 4 is a flow chart illustrating the predicate structurematching function according to the present invention;

[0031]FIG. 5 is a flow chart illustrating the predicate matching processaccording to the present invention;

[0032]FIG. 6 is a flow chart illustrating concept matching according tothe present invention;

[0033]FIG. 7 is a flow chart illustrating proper noun matching accordingto the present invention;

[0034]FIG. 8 is a flow chart illustrating argument matching according tothe present invention;

[0035]FIG. 9 is a diagram illustrating an individual neurode accordingto the present invention;

[0036]FIG. 10 is a diagram illustrating a general neural networkaccording to the present invention;

[0037]FIG. 11 is a block diagram illustrating the document clusteringcomponent according to the present invention;

[0038]FIG. 12 is a document clustering component feature map accordingto the present invention;

[0039]FIG. 13 is a block diagram illustrating the relevancy rankingcomponent followed by the document clustering component according to thepresent invention;

[0040]FIG. 14 is a block diagram illustrating the relevancy rankingcomponent according to the present invention;

[0041]FIG. 15 is a block diagram illustrating the document clusteringcomponent according to the present invention; and

[0042]FIG. 16 is a block diagram illustrating the document clusteringcomponent followed by the relevancy ranking component according to thepresent invention.

DETAILED DESCRIPTION OF THE INVENTION

[0043] The relevancy ranking and clustering method and system of thepresent invention is intended, as one example application, to work withthe concept-based indexing and search system set forth in co-pendingpatent application Ser. No. 09/627,295, which indexes collections ofdocuments with ontology-based predicate structures through automatedand/or human-assisted methods, and which is incorporated herein byreference. The relevancy ranking and clustering method and system canalso be used with other document indexing and information retrievalsystems, including question-answering systems, as described below.

[0044] In one example application of the present invention as embodiedas part of a concept-based indexing and search system, a user can makequeries in the form of natural language, keywords, or predicates.Queries are converted into ontology-based predicate structures, ifnecessary, and compared against documents, which have been previouslyparsed for their ontology-based predicates, to obtain the best possiblematching documents, which are then presented to the user.

[0045] The transformation of natural language sentences into predicatestructures is performed by an ontological parser, as set forth inco-pending patent application Ser. No. 09/697,676, and incorporatedherein by reference.

[0046] Predicate structures are representations of logical relationshipsbetween the words in a sentence. Every predicate structure contains apredicate, which consists of either a verb or a preposition, and a setof arguments, each of which may be any part of speech. The ontologicalparser converts a series of sentences first into a collection of parsetrees, and then into a collection of completed predicate structures.

[0047] Ontologies are hierarchies of related concepts, usuallyrepresented by tree structures. The ontology-based parser for naturallanguage processing application set forth in co-pending patentapplication Ser. No. 09/697,676, introduces a new implementation of thistree structure. The proposed data structure consists of an integervalue, where each digit of the integer corresponds to a specific branchtaken at the corresponding level in the tree.

[0048]FIG. 1 illustrates a high-level block diagram of the method ofranking the similarity between an input query 118 and a set of documents120 according to the present invention as embodied in the example systemdescribed above.

[0049] The relevancy ranking and clustering method and system consistsof three major units that work together to provide the required rankingand clustering services. The block diagram in FIG. 1 illustrates how theunits combine to rank and cluster documents in response to a user queryin an example application of the present system.

[0050] The first unit of the present system, the vectorization unittransforms the representations of documents and queries into a vectorform. In order for the parsed documents and queries to be used in thedocument clustering component and in portions of the relevancy rankingunit, the documents and queries must be represented as multidimensionalnumerical vectors. These vector representations provide the means ofefficient comparison between the documents and the query. An individualdocument may be represented by a single document vector, or it may berepresented by a collection of document vectors, depending on theparameters of the vector representation process and the length of thedocument. Queries are typically very brief, usually containing only one,or possibly a few predicate structures. Documents, on the other hand,typically contain tens or even hundreds of predicate structures. Despitethis disparity in the amount of information to be converted to vectorrepresentation, it is essential that both query vectors and documentvectors have exactly the same number of vector elements in order forcomparisons between the two to be valid.

[0051] The vectorization unit of the present system provides aninnovative and unique process for: (a) converting the text of documentsand queries into multidimensional numerical vectors; (b) ensuring thatthose vector representations are appropriately normalized for use indocument-to-query comparisons; and (c) guaranteeing that no matter howlong the document and how short the query, all such vectorrepresentations have exactly the same dimensionality. This vectorizationprocess thus provides the basis for the further comparison of documentsand queries through processes as typified by the document clusteringcomponent and the relevancy ranking unit. However, the vectorizationmethod described here can also support other methods of comparisons thathave similar vector representation requirements.

[0052] The vectorization unit has two configurations, a documentvectorization unit 130 and a query vectorization unit 134. Each of theseconfigurations converts ontologically parsed text into vectorrepresentations. The document vectorization unit 130 converts the set ofpredicate structures derived from ontologically parsing a document intoone or more large-dimensioned numerical vectors. The query vectorizationunit 134 performs the same task for an ontologically parsed user query.The resulting vectors are saved along with the predicate structures theywere derived from in the corresponding document predicate storage unit124 and query predicate storage unit 126. These vectors are used by theremaining two pieces of the present system, the relevancy ranking unit128 and the document clustering component 140.

[0053] The second piece of the present system, the relevancy rankingunit 128, provides concept-by-concept comparison of individual predicatestructures to a user's query. It also provides a coarse-grainedrelevancy estimate by comparing an overall pattern of predicatestructures in the document to the user's query.

[0054] The third piece, the document clustering component 140, doesfine-grained discrimination when many documents appear to be closematches to the user's query. This portion of the present method andsystem identifies groupings of such matching documents and provides afeedback method to rapidly identify which grouping or groupings are mostapplicable to the user query. Each of these three major units of therelevancy ranking and clustering method and system will be discussed indetail below.

[0055] The document vectors are generated by the document vectorizationunit, 130, illustrated in FIG. 2. This process receives the complete setof predicate structures produced as a result of the ontological parse ofthe document, and combines them to produce one or more numerical vectorsthat represent the pattern of predicate structures within the document.A similar query vectorization process is performed by a queryvectorization unit, 134, also shown in FIG. 2. These document vectorsand query vectors are used both by the vector matching component of therelevancy ranking unit and by the document clustering component. Thedetails of the vectorization process, which are similar for bothdocuments and queries, are explained below.

[0056] The first step in vectorization of documents occurs when theontological parser 122 originally processes the documents and theircorresponding predicate structures stored in the document predicatelibraries, as illustrated by document vectorization unit block 130 inFIG. 1 and FIG. 2. One or more vector representations based on thosepredicate structures are generated by the document vectorization unitand stored for future access. The same document vectors are used both bythe document clustering component and by the vector relevancy matchingcomponent of the relevancy ranking unit.

[0057] The predicate structures in the document predicate library areeach identified by a predicate key. The predicate key is a fixed-lengthinteger representation of the predicate, in which conceptual nearnesscorresponds to numerical nearness. Thus, a simple subtraction operationyields a rough estimate of the relative conceptual nearness of thepredicates. For example, the integer representations for “give” and“donate” are conceptually nearer than the representations for “give” and“purchase,” and thus, the difference between the integer representationsof “give” and “donate” is smaller than the difference between theinteger representations of “give” and “purchase.”

[0058] One or more multi-dimensional vectors are constructed for eachdocument 120 using the integer representations of the predicatestructures identified within the document 120 to perform vectorrelevancy matching for each document 120, in block 132 of FIG. 2.Because of the need for uniformity in vector relevancy matching results,a fixed number of predicate structures, M, are used. Typically, theserepresent the first M predicate structures in the document 120. Eitheronly the predicate or, optionally, the predicate plus a fixed number ofarguments from the predicate structure may be included from eachpredicate structure.

[0059] In the case where a fixed number of arguments are used, anypredicate structure with fewer than that number of arguments uses zerosin the remaining unfilled argument positions. Any predicate structurewith more than the specified number of arguments fills the argumentpositions until argument positions are all filled; remaining argumentsare ignored and omitted from the vector representation. Thus, if exactlytwo arguments are to be included from each predicate structure, apredicate structure with only one argument would insert a 0 for thesecond position. A predicate structure with three arguments wouldinclude the first two and omit the third.

[0060] The result of this process is an N-dimensional vectorrepresentation of that document 120, where N=M if only the predicateportion of the predicate structures are used, and N=q*M if a fixednumber of arguments (q-1) from the predicate structures are included.This is a design choice or system parameter. In most practicalapplications, N will typically be a moderately large number, on theorder of 50 to 100, although there is no conceptual limit on its size.However, there may be some performance degradation, as N grows larger.

[0061] In another embodiment, to perform vector relevancy matching oflonger documents, a document 120 may be represented by multiple vectorsinstead of only one. Thus, the first N predicates, optionally includinga fixed number of arguments for each predicate, are used for the firstvector representation, the next N predicates and arguments are used forthe next vector representation, and so on, until there are fewer than Nunused predicates and arguments remaining. For the final vector, thelast N predicates and arguments are used, even if there is some overlapbetween this vector and the immediately previous vector.

[0062] Once the vectors that represent a document 120 are composed, eachmust be normalized to a fixed vector length. Normalization can beperformed using any of a variety of well-known and establishedtechniques. It is not necessary that normalization be performed to aunit-length vector; any constant length is acceptable. A typical simplenormalization technique to a unit-length vector is to divide each vectorelement by the length of the original vector, as illustrated in thefollowing equations: $\begin{matrix}{W = \left( {w_{1},w_{2},w_{3},w_{4},\quad \ldots \quad,w_{n}} \right)} \\{{W} = \sqrt{\left( {w_{1}^{2} + w_{2}^{2} + w_{3}^{2} + w_{4}^{2}\quad + \ldots + w_{n}^{2}} \right)}} \\{W_{norm} = \left( {\frac{w_{1}}{W},\frac{w_{2}}{W},\frac{w_{3}}{W},\frac{w_{4}}{W},\quad \ldots \quad,\frac{w_{n}}{W}} \right)}\end{matrix}$

[0063] This is only one example of normalization. As those familiar withthe art are aware, other well-understood possibilities can be used. Theresult of the normalization operation is a set of one or more normalizeddocument vectors that represent the pattern of concepts identifiedwithin each document.

[0064] The normalized document vectors are typically stored along withthe predicate structures for that document at the time the documents areoriginally parsed by the ontological parser. As a result, efficiency isincreased since the vectors do not have to be repeatedly constructedwith each access to the document.

[0065] In addition to the document contents, vectorization must beperformed on the user query by the query vectorization unit 134.However, this can present a problem because a user query will typicallyconsist of only one or possibly a few conceptual predicate structurescompared to the many predicate structures found in a typical document.As a result, while the normalized document vectors may reflect perhaps50 or 100 predicate structures, the user query may have as little as asingle predicate structure to work with.

[0066] In this case, the query predicate structure or structures arerepeated enough times to make up the total number of elements needed toconstruct a vector of exactly the same dimensionality as the normalizeddocument vectors. Like the document vectors, the query vector must benormalized; this is done using the same normalization process used innormalizing the document vectors. The final result is a normalized queryvector.

[0067] If the query has more than one predicate structure, multiplequery vectors can be constructed with the individual query predicatestructures in various orders. Thus, if the query consists of twopredicate structures, A and B, two query vectors can be constructed, onewith the predicate structures ordered as

[0068] (A, B, A, B, . . . , A, B)

[0069] and one with the predicate structures ordered as

[0070] (B, A, B, A, . . . , B, A)

[0071] The vector matching unit and the document clustering componentcan then operate on each of these query vectors in turn.

[0072] Relevancy ranking, the second major component of the relevancyranking and clustering method and system, is a process that produces aset of documents sorted or ranked, according to certain criteria. Therelevancy ranking process, according to the present invention, uses asimilarity comparison algorithm to determine the relevance of a documentto a query. One or more query predicate structures are generated usingan ontological parser to parse input queries. One or more documentpredicate structures similarly are generated using an ontological parserto parse the set of known documents. Each query predicate structure iscompared with each document predicate structure to determine a matchingdegree, represented by a real number. A multilevel modifier strategy isused to assign different relevance values to the different parts of eachpredicate structure match to calculate the matching degree of eachpredicate structure. The relevance of a document to a user's query isdetermined by calculating the similarity coefficient, based on thestructures of each pair of query predicates and document predicates.

[0073] The relevancy ranking unit comprises multiple components thatperform different levels of similarity comparison. The first componentis a predicate vector matching unit that dynamically compares thecoarse-grained overall pattern of predicate structures for each documentto those of the user query and returns a ranking by predicate patternsimilarity. The second component is a predicate structure matching unitthat compares two predicate structures and returns a similarity measure.The third component is a predicate matching unit that compares thesimilarity between the predicate parts of two predicate structures andreturns a similarity measure. The fourth component is an argumentmatching unit that compares the argument parts of two predicatestructures and returns a similarity measure. The fifth component is aconcept matching unit that compares two concepts and returns asimilarity measure. Finally, the sixth component is a proper nounmatching unit that compares two proper nouns and returns a similaritymeasure.

[0074] The relevancy ranking unit considers a set of factors that impactthe ranking algorithm, and implements a multiple-level modifier strategyto adjust the weight of each factor.

[0075] There are six steps in the relevancy ranking method of thepresent invention. First, a group of candidate documents 120 are sent tothe ontological parser 122. Second, the ontological parser 122 parseseach document 120 and generates one or more predicate structures foreach sentence in the document. The set of predicate structures from adocument 120, along with one or more document vectors produced by thedocument vectorization unit 130 are stored in a document predicatestorage component or library 124. The document predicate librarycontains a formal representation of the document 120 by storing the setof predicate structures representing the sentences in the document 120.The document predicate library for each candidate document 120 is storedin a document predicate storage component 124 and can be retrieved by aprimary key. Third, an input query 118 is sent to the ontological parser122. Fourth, the ontological parser 122 parses the input query 118 andgenerates one or more predicate structures. All predicate structuresfrom the input query 118 are represented in a query predicate librarythat is stored in a query predicate storage component 126. Fifth, aquery predicate library (representing an input query 118) and a set ofdocument predicate libraries (representing a set of documents .120) aresent to the relevancy ranking component 128 to compare the similaritylevel between an input query 118 and the documents 120. Documents 120are then ranked in the order of their similarity levels. Sixth, thedocuments 120 are returned in ranked order.

[0076] The vector relevancy matching component of the relevancy rankingunit provides an efficient, computationally concise ranking of thedocuments based on the predicate structures from those stored for thedocuments 120. This matching technique uses primarily the predicateportions of the stored information, and thus does not do fine-scaleranking. It makes use of the vectors produced by the documentvectorization unit 130 and the query vectorization unit 134 to performits ranking operations.

[0077] The operation of the vector matching unit is illustrated in FIG.2. It comprises the steps of constructing a vector of the user query134, retrieving the vectors representing the documents 130 that need tobe ranked from the document predicate library, performing a dot-productoperation between the user query vector and each of the document vectors132, ranking the documents in order of the dot-product result, fromlargest value (most relevant) to smallest value (least relevant), andreturning the rankings.

[0078] The predicate vector matching implemented by the relevancymatching component has the following inputs:

[0079] Query_predicateLibrary: a query predicate library structurerepresenting an input query and containing all predicate structuresgenerated by the ontological parser from parsing the input query, and

[0080] Doc_predicateLibrary: a document predicate library structurecontaining the set of predicate structures representing the sentences inthe natural language source document.

[0081] Query predicates are converted into a queryvector. For eachdocument inside the Doc_predicateLibrary the correspondingdocumentvector(s) are retrieved. The dot product of the queryvector andthe documentvector is computed. The matching_score is made equal to thedotproduct. The document is inserted into the ranking list, with thehighest dotProduct values first, and the lowest dotProducts last.

[0082] The output of the predicate vector matching algorithm is a rankedlist of documents with the closest-fit documents at the top of the listand the worst fit documents at the bottom of the list.

[0083] For more fine-grained relevancy ranking, other techniques used bythe relevancy ranking component are used to perform one-to-onecomparisons of the predicate structures within the documents to thepredicate structures within the query. These other techniques areexplained next.

[0084] Since both input queries and documents are converted into one ormore predicate structures, the similarity between an input query and adocument depends on the similarity between query predicate structuresand the document predicate structures. There are different strategiesfor matching two predicate structures when they are similar but do notexactly match.

[0085] As described previously, a predicate structure consists of apredicate, which is either a verb or a preposition, and a set ofarguments, which may be any part of speech. Two predicate structures canbe matched, in the predicate matching step 136, by comparing their verbsin a verb (preposition) only match, where only the predicate part(usually a verb or a preposition) of two predicate structures iscompared. A noun match may be performed if two predicate structures donot match in their predicate part, wherein their respective argumentsare compared for matching nouns. A verb and noun match compares thewhole predicate structure.

[0086] In order to precisely determine the information converted by thepredicate structure, a multiple level modifier strategy 138 isimplemented to adjust the weight for each factor that modified theinformation converted to the predicate structure. Modifiers are definedbased on a number of factors.

[0087] One factor is the predicate structure abstraction level. Thepredicate structure abstraction is represented by the predicatestructure match type. A “Verb Only Match” is more abstract than a “NounMatch.” Similarly, a “Noun Match” is more abstract than a “Verb and NounMatch.” The parameters VerbOnlyMatchModifier, NounModifier andVerbNounMatchModifier are defined to adjust the weight of differentpredicate structure abstraction levels. The more abstract a match type,the smaller the weight it receives.

[0088] Another factor is concept proximity, which represents theontological relationship between two concepts. Each concept in a parsetree can be represented as an integer. The smaller the differencebetween two concepts, the closer their ontological relationship is. Thecloser the ontological relationship, the higher the relevancy bonus. Fortwo exactly matched concepts, the parameter ConceptExactMatchModifieradjusts the weight. The parameter ConceptProximityModifier adjusts theweight for two concepts that are not exactly matched. Each concept nodein the ontology hierarchy tree has a unique integer identifier, and allof these numbers have the same number of digits. Thus, the value ofidentifier_digit_number represents the number of digits the integeridentifier has, and the variable highest_order_difference_digitrepresents how many digits the difference between two concepts has. Themodifier weight for ConceptProximityModifier is defined as${ConceptExactMatchModifier} \times \left( {1 - \frac{{highest\_ order}{\_ difference}{\_ digit}}{{identifier\_ digit}{\_ number}}} \right)$

[0089] Sentence position is another factor, which postulates thatsentences appearing early in a document may contain the title orabstract of a document, and predicates containing the title or abstractwill have a higher information content. FrontLineModifier is defined forpredicate structures representing one of the first ten sentences in adocument.

[0090] Another factor is the degree of proper noun matching. This factorconsiders the similarity between two proper nouns.ProperNounExactMatchModifier is defined to modify the matching degreebetween two exactly matched proper nouns, and SymbolMatchModifier isdefined to modify the matching degree between two proper nouns in whichone is the symbol of the other.

[0091] Word stem is a factor that takes in to consideration whether twowords are from the same word stem. For example, the word “sleeps” and“sleepy” have the same word stem, “sleep.” SameStemModifier is definedto adjust the matching degree of two words having the same word stem.

[0092] Document size is a factor that takes into account the size of adocument. A DocSizeModifier parameter is designed to prevent a largenumber of predicate occurrences in a document from over-weighting thematching score. A short document containing 10 instances of a querypredicate structure will be given a higher weight than a much longerdocument with 10 instances of the same query predicate structure.

[0093] Table 1, shown in FIG. 3, sets forth an example of modifier name,default weights, and an explanation of each. The default weights shownare just examples to show the relative magnitude of each modifier.Actual weighting parameters are defined experientially.

[0094] The similarity comparison process implemented by the relevancyranking component determines the similarity between a query and adocument. The inputs to this algorithm are:

[0095] Query_predicateLibrary: a query predicate library structurerepresenting an input query and containing all predicate structuresgenerated by an ontological parser from parsing the input query,

[0096] Doc_predicateLibrary: a document predicate library structurecontaining the set of predicate structures representing the sentences inthe natural language source document, and

[0097] Match_type: a representation of the predicate structure matchtype.

[0098] The output of this algorithm is a real number representing thesimilarity level between a query and a document.

[0099]FIG. 4 is a flow chart that describes how the predicate structurematching component determines a matching degree between two predicatestructures. Two input predicate structures are compared in block 200. Ifthey match exactly the result is returned to the user in step 202. Ifthe two predicate input structures do not match exactly the degree ortype of match is determined in step 204.

[0100] If a noun match is desired, the predicate parts are extracted instep 206 to provide two predicates 208. If the two predicates 208 matchin step 210 the result is returned in step 212. If the two predicates208 do not match the arguments are extracted from the predicatestructures in step 214, and the arguments are matched in step 216.Matching arguments are sent to an argument matching unit and to thedegree of matching calculation step 218.

[0101] If only a verb match is desired, the predicate parts areextracted in step 220 to provide two predicates 222. If the twopredicate parts 222 match in step 224 the matching predicates are sentto a predicate matching unit in step 225 and to the degree of matchingcalculation step 218.

[0102] If both verb and noun matches are desired, the verbs are matchedin step 226, and steps 220 to 225 are followed. The nouns are thenmatched in step 228, and steps 206 to 218 are followed. Furthermore,after the noun matching step 228 and the verb matching step 224, thedegree of matching is calculated in step 219.

[0103] The procedures to determine the degree two predicate structuresmatch are described below. A determination is made whether the twopredicate structures are an exact match, namely do the query predicatestructure and the document predicate structure match exactly. If thereis an exact match then the matching degree is set to equal the predicateStructure Exact Match Modifier. If the two predicate structures do notmatch exactly then a determination is made of the matching degree basedon the input predicate structure match type.

[0104] If only the verbs match, the value S_(verb) is set to equalverb_only_match between the query predicate structure and the documentpredicate structure. The matching_degree is then set equal to theproduct of the VerbOnlyMatchModifier and the S_(verb).

[0105] If only the nouns match, the value S_(noun) is set equal tonoun_match between the query predicate structure and the documentpredicate structure. The matching_degree is then set equal to theproduct of the NounOnlyMatchModifier and S_(noun).

[0106] If both the nouns and verbs match, the value S_(verb) andS_(noun) are set to equal verb_only_match and noun match, respectively,between the query predicate structure and the document predicatestructure. The matching_degree is then set equal to the product of theNounOnlyMatchModifier and S_(noun). plus S_(verb).

[0107] When only the verbs of the two predicate structures match, thematching function is as follows. First, the predicate part is extractedfrom each predicate structure to generate two predicate objects,query_predicate and doc_predicate. Second, the two extracted predicateobjects are sent to a predicate match unit, which returns a matchingscore of the two predicates.

[0108] When only the nouns of the two predicate structures match, thematching function is as follows. First, the predicate part from eachpredicate structure is extracted and two predicate objects aregenerated, query_predicate, doc_predicate. Second, a determination ismade as to whether the two predicate objects exactly match. If the twopredicate structures exactly match, the score is set equal to thepredicate Exact Match Modifier. Otherwise, arguments are extracted fromeach predicate structure and two argument lists, query_argument_list anddoc_argument_list are generated. The two argument lists are then sent toan argument matching unit, and a matching score for the two argumentlists is returned.

[0109]FIG. 5 is a flow chart showing how the predicate matchingcomponent determines the degree two predicates match. A determination ismade in step 300 whether the two predicates exactly match. If the answeris yes, the score is set to the predicateExactMatchModifier and returnedin step 302. If the two predicates do not match exactly, in step 304 adetermination is made as to whether the two predicates are from samestem. If the two predicates are from the same stem the score is set tothe SameStemModifier and returned in step 306. If the two predicates arenot from the same stem, a concept is extracted from each predicateobject in step 308, and a pair of concepts 309, query_concept anddoc_concept, are generated. The two concepts are sent to a ConceptMatching Unit, and concept matching score is returned, where the scoreequals match_concept (query_concept, doc_concept).

[0110]FIG. 6 is a flow chart that describes how the concept matchingcomponent determines the degree two concepts match. In step 310 adetermination is made whether the two input concepts 309 match exactly.If the result of step 310 is positive, the score is set equal to theConceptExactMatchModifier and returned in step 312. If the result ofstep 310 is negative, the difference between the two concepts iscalculated in step 314, with the highest order digit of the differencebetween the numerical representation of the query_concept in theontological tree structure and the numerical representation ofdoc_concept in the ontological tree structure. TheConceptProximityModifier is calculated in step 316 by dividing theConceptExactMatchModifier by the difference between the two concepts.The resulting score is returned in step 318.

[0111]FIG. 7 is a flow chart describing how the proper noun matchingcomponent determines a matching degree between two proper nouns. Adetermination is made whether the two input proper nouns exactly matchin step 320. If the result of step 320 is positive the score is setequal to the properNounExactMatchModifier and returned in step 322. Ifthe result of step 320 is negative, a determination is made whethereither proper noun is a symbol for the other proper noun in step 324. Ifthe result of step 324 is positive, the score is set equal toSymbolMatchModifier and returned in step 326. If the result of step 324is negative, a score of 0 is returned in step 328.

[0112]FIG. 8 is a flow chart that describes how the argument matchingcomponent determines a matching degree between the arguments of twopredicate structures. Iteration is performed through both lists as longas both have more arguments in step 330. If both lists have no morearguments then the current computed value of matching degree is returnedin step 332. If both lists have more arguments, one argument from eachlist is retrieved to generate two arguments, a query argument and adocument argument in step 334. The two arguments 335 are checked todetermine if they match exactly in step 336. If the two arguments matchexactly a matching degree is calculated in step 342, and an alreadyprocessed query argument is deleted from the query argument list in step344. The process then returns to step 330 to determine if both listsstill have more arguments.

[0113] If the result of step 336 is negative, a determination is madewhether the two arguments are both proper nouns in step 338. If botharguments are proper nouns, the concepts are sent to the proper nounmatching unit in step 339. If both arguments are not proper nouns,concepts are extracted from each argument, and two concepts aregenerated in step 340. The two concepts are then sent to a conceptmatching unit in step 341.

[0114] The third major component in the relevancy ranking and clusteringmethod and system is the document clustering component. In anysufficiently extended document space, it is likely that generally wordedqueries will result in far too many exact or nearly exact matches. It iscommonly found that users who are presented with a long list of matchesfrom a document retrieval process rarely look past the first 25 or 30such matches. If the list exceeds that length, all other matches will beignored. Yet it is also true that a query worded with too muchgenerality can easily retrieve many times that number of documents fromeven a moderately large search space, all of which conceivably providean exact predicate match for the user query. In these cases, thedocument clustering component is used to further distinguish the founddocuments and identify the exact ones that provide the requestedinformation.

[0115] The document clustering component comprises a self-organizingneural network that self-organizes based on the set of documentsreturned by the relevancy ranking unit as being exact or near-exactmatches to the user query. The self-organization results in theidentified documents being clustered based on their patterns ofpredicate pairs as discovered during the search process. The documentclustering component then compares the user query predicate structure tothe resulting self-organized map, identifying clusters of documents thatare most likely to fit the user's intent. The document clusteringcomponent provides a feedback mechanism by which users can determine ifthe identified cluster(s) of documents are good fits. This process alsorapidly focuses on the correct cluster, whether or not the originallychosen cluster is correct.

[0116] The purpose of the document clustering component is to identifythe exact documents desired by the user with the greatest possibleconfidence and the minimum possible time and effort on the part of theuser. In effect, the document clustering component organizes returneddocuments into similarity clusters, which clusters are themselvesorganized by similarity. This organization and clustering process isperformed automatically, without the need for humans to determinecontent or topic for the clustered documents.

[0117] Users may direct attention to one or a subset of clusters, andthe document clustering component can thus rapidly and efficientlyreturn only those documents desired, even when the original user querywas too broadly worded to accurately focus the search effort.

[0118] The document clustering component operates in two modes. In theself-organization mode it adapts to the collection of identifieddocuments matching a specific search effort. In the cluster retrievalmode it identifies appropriate cluster(s) of documents and returns themto the user, thus refining the search effort to more specificallyrespond to the user's needs.

[0119] Document clustering using the document and query vectors producedby the previously described vectorization techniques achieves severalkey advantages. First, it takes advantage of the basis for proximalconcept co-occurrence in a more effective way than looking for simpleword repetitions. Often authors of documents try to avoid too muchrepetition of the same term because human readers perceive such awriting style as repetitious and boring. Thus, a document dealing withhousing might refer to “abodes,” “houses,” and “residences” within asingle sentence or paragraph. Simple proximal word co-occurrence doesnot acknowledge that these are virtually the same concept. The use ofontology-based conceptual predicates identifies all such terms asidentical or with numerical representations that vary only slightly fromeach other. Thus the notion of proximal concept co-occurrence providesmuch greater power than the simpler proximal word co-occurrence used inother systems.

[0120] The second key advantage notes that frequency of concept isclearly related to importance within a discussion. Because thenormalized query vector, in effect, repeats the query concept orconcepts many times, it can identify similar concepts located throughoutthe normalized document vectors. As a result, documents that have a highfrequency of occurrence of the query concept are more likely to bereturned by the document clustering component than documents thatmention the query concept only rarely.

[0121] A further advantage of this vectorization technique is that itreflects the reality that a given query concept may appear in manypossible positions within the document, not merely in the firstsentence, or even the first paragraph. The replication of the queryconcept throughout the query vector in effect checks for this concept inalmost every possible location within the document. There thus is nopenalty to documents that have the relevant discussion further down intheir text. Those documents that are the most relevant-documentscontaining discussions of the query concepts more frequently-arereturned.

[0122] These two modes make use of any of several self-adaptive neuralnetwork structures. A neural network is a computational system that usesnonlinear, non-algorithmic techniques for information processing.Although the neural network structure specified herein is illustrativeof the type of neural network architecture and learning algorithm thatmay be used in this component, the scope of the present invention is notintended to be limited to the specific embodiment disclosed herein, asalternative embodiments will be obvious to one skilled in the art.

[0123] A neural network generally consists of a large number of simpleprocessing elements, called neurodes 400 herein. FIG. 9 illustrates asingle neurode 400. The neurodes 400 receive input signals along one ormore individually weighted and individually adjustable incomingconnections, and generate a single, nonlinear response signal which istransmitted to one or more other neurodes, or to the outside world, viaa separate set of individually weighted connections.

[0124] Typically, the neurodes 400 of a neural network are organizedinto layers as illustrated in FIG. 10, with the primary communicationamong neurodes 400 being interlayer; i.e., the neurodes 400 of thefirst, or input, layer 402 transmitting their outputs to the inputs ofthe neurodes 400 of the second layer 404, in the middle processing layer406, over individually weighted connections, and so on. Theeffectiveness of the transmission of signals into a neurode 400 dependson the weight of the connections over which the signals travel. Apositively weighted (excitatory) connection tends to increase theactivity of the receiving neurode 400 and thus increase the resultingoutput from that receiving neurode 400. A negatively weighted(inhibitory) connection tends to decrease the activity of the receivingneurode 400, and thus decrease the resulting output from that receivingneurode 400. The process of training the neural network consists ofestablishing an appropriate set of weights on these connections so thatthe overall response of the network to input signals is appropriate. Theneural network emits output signals through the output layer 408.

[0125]FIG. 11 provides a block diagram of the process of clusteringdocuments using the document clustering component. As shown in FIG. 11,the process includes three steps, namely, document vectorization 500 ofthe documents in the set 502 to be clustered, query vectorization 510 ofthe original user query 512, and training the document clusteringcomponent on the vectorized documents, user interaction and feedback todetermine the appropriate cluster(s) of documents to return to the user520.

[0126] Once the document set 502 is vectorized, training can begin. (Asnoted in the previous section, the vectorization step is normally doneat the time the document is placed in the document predicate storage106.) Training occurs using a self-organizing feature map 504 ofarbitrary size. Typically, the size of the map 504 will be determinedbased on (a) the number of categories desired as determined bypre-established system parameters; (b) the size of the document set 502;and (c) the desired granularity of the categories. The number ofneurodes in the feature map 504 can either be fixed as a systemparameter, or dynamically determined at the time the map is created.

[0127] Once the feature map is prepared, the final step is to determinewhich cluster or clusters should be returned to the user, and whichdocument within that cluster should be specifically used as an example.As noted previously, several techniques exist to make these decisions,and the best one will be predicated on the needs of the specificapplication. One technique is discussed herein to illustrate how thisprocess is performed, however, it is not meant to limit the scope of thepresent invention.

[0128] One useful method to determine the cluster and document is tomake use of the query vector produced in the vectorization process. Thenormalized query vector is applied to the feature map, in the samemanner as the document vectors were. The neurode with the weight vectorthat produces the largest dot-product computation with this query vectoris the “winner.” That neurode has a list of associated documentsproduced during the final processing stage of training. Those documents(in a very large document set) are the cluster most closely associatedwith the concepts in the query. Thus, that cluster is the one thatshould be presented to the user. In a smaller document set, thedocuments from that neurode and its immediately neighboring neurodesconstitute the winning cluster.

[0129] It is also possible to present the user with several choices ofclusters by retrieving the cluster that the query most closely matches,and one or more clusters represented by neurodes at greater distancefrom the winning neurode. For example, if the system is set to returnfour clusters for the user to select among, the winning neurode definesthe first cluster, and the remaining three clusters are represented byneurodes approximately equally spaced throughout the physical network.Clearly, any desired number of cluster choices can be selected, up tothe total number of neurodes within the self-organizing layer, dependingon system application and utility. The result, no matter how manyclusters are chosen, is a small set of documents that effectively spansthe conceptual space of the document set.

[0130] As shown in FIG. 11, the document clustering component makes useof the document vectors previously generated by the documentvectorization unit 130, and the previously generated user query vectoror vectors generated by the query vectorization unit 134. With theseinputs, the document clustering component performs the steps ofself-organization of the document clustering component, and clusterretrieval from the trained network.

[0131] In one implementation of the self-organizing network described,the network's key processing layer, called the self-organizing layer,includes connections not only to other layers, but also to otherneurodes within the self-organizing layer. The activity of theself-organizing layer is mediated by a secondary competition effect.

[0132] The intralayer, i.e., from a self-organizing layer neurode toother neurodes within that layer, connections mediate the competition.Weights on the intralayer connections between two arbitrary neurodesvary with the physical distance between the neurodes so that immediatephysical neighbors generate positive, stimulating signals (i.e., overpositively weighted connections), while neighbors further apart generatenegative, inhibitory signals (i.e., over negatively weightedconnections). Those neurodes at the farthest distances (such as neurodesalong the physical edges of the layer of neurodes) provided a slightpositive stimulation. The net effect is that an input signal arriving atthe self-organizing layer results in an intralayer competition thatensures a single, best-match neurode wins. Only that winning neurodegenerates an output signal to pass on to the following layer ofneurodes. In essence, the activity of all other neurodes is damped bythe competition with the best-fit winning neurode.

[0133] Alternative implementations of such a self-organizing map as acomputer program do not require exact simulations of such competition;it is enough that the end result is produced. For each input signalpattern, the only output signal comes from the single neurode that bestmatches that input pattern as determined by the dot product of the inputpattern vector (normalized) and the corresponding (initially normalized)weight vector of the neurode.

[0134]FIG. 12 illustrates a typical configuration of the feature map504. The intra-layer connections shown are present to illustrate theneighbor concept, but are not actually used as part of the trainingprocess. For simplicity and clarity in the diagram a 7-dimensional inputvector 506 to an input layer 508 is shown, but in actual practice, thedimensionality of the input vector 506 is more likely to be in the rangeof 50 to 100. Dimensionality is a parameter that can be establishedbased on specific application needs.

[0135] The overall geometry of the self-organizing layer is one of acontinuous strand of neurodes. Immediate neighbors of each neurode arethose on either side of it according to the virtual intra-layerconnections shown. The size of the neighborhood around each neurode is aparameter that is varied in the course of training. Typically, theinitial size of the neighborhood is fairly large so that as much as25-50% of the total self-organizing layer constitutes the neighbor ofany given neurode. As training proceeds, this neighborhood size israpidly lowered until it reaches 1 or 0, at which point only the singlewinning neurode adjusts its weights.

[0136] The overall training process for a given set of document vectors,D={I₁, I₂, . . . , I_(n)} includes initializing the self-organizinglayer of the network. Initialization includes determining the size ofthe self-organizing layer, either from a fixed system parameter, or fromdynamic considerations such as the total number of document vectorswithin D. For the following discussion, the size of the self-organizinglayer is referred to as M.

[0137] An initial set of weight vectors is then established for eachneurode in the self-organizing layer. These are the weights associatedwith the connections between the input layer and the self-organizinglayer. Note that the dimensionality of these weight vectors is alwaysexactly the same as the dimensionality of the document vectors sincethere is always exactly one input connection for each element of thedocument vectors. For the initial weight set, the M weight vectors areeither set to random weights or to the first M input vectors in D.

[0138] When the total number of document vectors in D is very large, thesecond choice may be most appropriate, since the second choiceguarantees that every neurode accurately reflects at least one inputvector. However, in smaller document vector sets, the second choice canunfairly bias the self-organizing map toward those documents that (bychance) happen to occur early in the document set. The random process issafer, but somewhat more computationally expensive; setting the weightsto members of the document vector set is computationally cheap but mayimpose some unwanted bias on the feature map.

[0139] Once the initialization is complete, the actual training processstarts. An input vector In is applied to each neurode in theself-organizing layer. The closeness of the weight vector of eachneurode to the input vector is determined by computing the dot-productof the two vectors. Next, the neurode with the largest result of thedot-product computation is determined and is declared the “winner.” Thiscomputation is computed as:${I_{n} \cdot W_{m}} = {\sum\limits_{i = 1}^{k}\quad {I_{n\quad i}W_{m\quad i}}}$

[0140] In this equation, the nth input vector and the mth neurode'sweight vector are dotted. The dimensionality of each of these vectors isk (i.e., there are k elements in each of the input vector and the weightvector).

[0141] The formula for the dot product of an input vector I and a weightvector W is:

I□W=□I _(i) *W _(i) =|I|*|W|*cos á

[0142] where the summation is taken over all elements i of the twovectors I and W, |I| is the length of the vector I (with correspondingmeaning for |W|), and á is the angle in n-dimensional weight-spacebetween the two vectors.

[0143] The weight vectors of the winner and each of its currentneighbors, as determined by that neighborhood size are then modifiedbased on the current (and dynamically changing) neighborhood size. Theweight vectors are modified as follows:

W ^(new) _(m) =W ^(old) _(m) +â*(I _(n) −W ^(old) _(m))

[0144] Where â is a learning parameter between 0.0 and 1.0. Typically,this parameter is on the order of 0.25 or less, though its value mayalso be dynamically changed during the course of the training.

[0145] The process then continues to the next input vector, repeatingthe foregoing steps. Training proceeds and the neighborhood size isdecreased until the complete document set is processed. For very largedocument sets, a single pass through the data may suffice. For smallersets, the training process can be iterated through the document set asnecessary.

[0146] Because this training is computationally extremely simple, it canbe performed very quickly. Furthermore, the training process can begineven as the search process continues since documents can beincrementally added to the document set without loss of performance.

[0147] The foregoing process produces a self-organized feature map thathas clusters of neurodes representing conceptual clusters of thedocuments in the feature set. Neurodes that represent documents that areconceptually near will themselves be physically near within the featuremap (where “nearness” within the feature map is determined by theneighborhood distance within the network). One additional pass throughthe input vectors can now be done without doing further modifications tothe feature map. The purpose of the final pass is to make a list ofwhich documents correspond to which winning neurode in the final map.Thus, there is an internal list associated with each neurode that notesfor which documents that neurode is the “winner.”

[0148] The single winning neurode represents the best match between theinput signal and the currently organized network's set of weightvectors. In n-dimensional weight-space, the input vector and the weightvector of the winning neurode most nearly point in the same direction(i.e., have the maximum cosine of the angle between the two vectors).The normalization of the input vectors (i.e., the stored documentvectors) and initial weight vectors is important because this impliesthat the lengths of the corresponding input and weight vectors isconstant to an arbitrary length, usually, though not necessarily, 1.0.This also implies that the normalization process used on the documentand query vectors is the same normalization process that must be used onthe weight vectors in the neural network. Additionally, issues ofdimensionality and scale are avoided via the normalization procedure.

[0149] In a typical implementation of the neural network architecture,the winning neurode and, possibly, some of its immediate physicalneighbors adjust their weight vectors according to any of severallearning laws. The simplest of these is

W _(new) =W _(old) +â*(I−W _(old))

[0150] where â is a parameter that may be varied during the course ofthe self-organization mode, but is in all cases between the values of0.0 and 1.0. W is the weight vector being modified from its old valuesto its new values and I is the input vector for a particular inputsample. The effect of such weight adjustment is to nudge the weightvectors of those neurodes that are adjusting their weights to positionsfractionally closer to the position of the input vector in n-dimensionalweight-space, where the fraction is determined by the value of a. If onevisualizes the input vector and the various weight vectors as beingdistributed around a normalization hypersphere in n-dimensionalweight-space, the weight adjustment moves the weight vector along achord that stretches between the initial position of the weight vectorand the input vector.

[0151] The determination of which, if any, of the winning neurode'sphysical neighbors should adjust their weight vectors according to thisscheme is part of the self-organizing process. Typically, the initialnumber of the adjusting neighbors is a substantial fraction of thenetwork as a whole, thus ensuring that all neurodes participate in theself-organizing process. As self-organization proceeds and additionalinput vectors are presented to the network, this number is reduced untilonly the winning neurode adjusts its weight vector.

[0152] With a sufficiently large collection of documents on which toself-organize, it is possible for the self-organization step to proceedin an incremental fashion, as potentially matching documents areidentified and added to the document set. If the set of documents ismoderate in number (though still too many to return to the userdirectly), the system can perform the self-organization step byiterating through the set of returned document vectors until anappropriate level of clustering is achieved.

[0153] The total number of clusters possible within the neural networkis limited to the total number of neurodes in the self-organizing layer.This is a definable parameter and may vary depending on specificapplication needs.

[0154] Unlike other document search and retrieval systems, the presentsystem considers the overall pattern of predicate structures acrosssubstantial portions of the document to determine the overall meaning ofthe document. The predicate structures derived from the ontologicalparser are combined into a moving window of fixed but arbitrary size,with each predicate structure providing a fixed number of input vectorelements. The resulting input vector provides an encoding of therelationship of predicate structures to each other within the document.This pattern of predicates permits the document clustering component toself-adapt to the overall meaning of the documents being learned.

[0155] Because the documents used for training consist of thosedocuments returned by the ontological parsing system as matches for thespecified query, the training results in a clustering of the documentsinto “more similar” and “less similar” categories because theself-organizing neural network after training has the followingcharacteristics:

[0156] Documents with similar global content (as opposed to similarindividual predicate structures) are represented within the network byneurodes that are physically near each other;

[0157] Documents with dissimilar global content are represented withinthe network by neurodes that are physically far from each other;

[0158] The clustering of neurode weight vectors approximately mimics theclustering of the input vectors representing the various documents usedin training. Thus, documents from large, complex clusters will have manyweight vectors representing that space, resulting in finer-detail ofrepresentation. Documents that are relatively rare in the training spacewill have fewer corresponding weight vectors representing them.

[0159] The clustering of neurodes approximately mimics the probabilitydistribution function of the set of document vectors.

[0160] Once the Document Clustering Component has produced a neuralnetwork trained to distinguish the documents returned by the earliersearch effort, it is ready to be used to determine the exact document(s)needed by the user. This is done in cluster retrieval mode by presentingusers with one or, possibly, a small selection of documents thatrepresent the corresponding document clusters. Such documents areconsidered by the system as being “typical” of the documents within thatcluster. Although additional processing of the clusters can be performedto further refine the separation, it is not necessary in most instances.

[0161] Users can determine whether the presented sample document ordocuments are either similar to the requested documents or not similarto the needed documents. Based on that selection, the appropriatedocument(s) are provided. In the case where the user indicates apresented document is similar to desired documents, documents within theselected document's cluster are returned to the user. When the userindicates a presented document is dissimilar to desired documents, oneor more sample documents from one or more clusters far from thepresented document are provided for similar rankings. Because thedocument clustering component has positioned the sample documents withinsimilarity clusters, and because those similarity clusters arethemselves arranged in order of similarity, it is a near-certainty thatappropriate documents are returned to the user within one, or at most afew, iterations of this user selection procedure. The latter case willoccur only when an extremely large set of documents is returned by thesearch effort.

[0162] The determination of the cluster from which to select a samplecan be performed in a variety of ways, based on computational needs andother user-defined parameters. Typical methods include but are notlimited to, returning one sample document from each cluster, returningone sample document from the cluster most closely matched by the user'soriginal query, returning one sample document from only the largestcluster, and returning one sample document from a randomly selectedcluster.

[0163] Any of the foregoing methods can be used, along with othersuitable methods, depending on system performance requirements and userpreferences.

[0164] The determination of the specific sample document to select froma cluster can also be made in a variety of ways, based on specific userand system needs. Typical choices include, but are not limited to,returning a random document from the cluster, and returning the documentclosest to the center of the cluster.

[0165] Returning a random document is computationally simple and wouldbe applicable in situations where performance is a concern. The secondmethod involves computing the center of the cluster of documents usingany of several well-understood mathematical formulations, and thenreturning that document or documents which most closely match thatcenter. The second method has the advantage of returning a “typical”document that accurately represents the contents of the documents in thecluster. The second method has the disadvantage of requiringconsiderably more computational overhead to determine the center, andthen more computation to determine the document or documents closest tothat center. The tradeoff is between efficiency and precision.

[0166] In an alternate method of query, users who are doing preliminaryscope-of-the-subject searches can request sample documents from eachcluster within the original query hits. Thus if the documents returnedfrom the ontological parser fall into P clusters, the system can providea sample document from each of those clusters to give users a sense ofthe scope of information about a query that is available within theoriginal search space. The user may then choose to focus on a specificcluster of documents and ask for those only, or may request that asubset of clusters be returned, or may request that all documents bereturned.

[0167] Presentation of the sample document to the user is doneinteractively. For each chosen cluster, one sample document is selectedfrom that cluster. The sample document can be a random choice, orselected by any other means from the list of documents associated withthat neurode. For example, if four clusters are selected, and the firstdocument on each neurode's list is might be selected as the exampledocument for that cluster. Information about each document may bepresented to the user, as:

[0168] A thumbnail image of the document;

[0169] A full-size version of the document in a new window;

[0170] The first few lines or sentences of the document;

[0171] An embedded title and description of the document as encodedwithin the document itself; or

[0172] Any other appropriate summary form.

[0173] The user can then request more documents like the sampledocument, documents very different from the sample document, ordocuments like the sample document, but with some new query attached.

[0174] In the first case, the set of documents represented by thatcluster are returned to the user. In the second case, documents from acluster or clusters far from the original cluster (as represented on theself-organizing map) are presented in a repeat of the above interactiveprocess. And in the final case, a new query vector is produced thatcombines the original query concepts with the added information from therevised query to produce a new query vector. This query vector isapplied to the same self-organizing map to identify a new cluster ofdocuments to present, in a similar process as described above.

[0175] The relevancy ranking and clustering method and system can beapplied in a wide variety of information retrieval applications bycombining the three elements of the method and system in differentorders and combinations. The single component that should always beapplied first is the vectorization unit because it generates documentand query vectors that are used both by the relevancy ranking unit andby the document clustering component. Depending on the goals of theapplication, however, the relevancy ranking unit and the documentclustering component can be applied in various orders and combinations.FIGS. 13 through 16 illustrate the possible combinations for thesecomponents.

[0176] In FIG. 13, the relevancy ranking and clustering method andsystem is applied as part of an information retrieval or search engine.Known parsed documents 120 are converted into vector representations bythe document vectorization unit 130 and stored in a document predicatestorage unit 124, along with the parsed predicate structures. When theuser enters a query, it is similarly converted by the queryvectorization unit 134 into a vector representation, which is stored inquery predicate storage 126. The relevancy ranking unit 128 then usesthese various representations of the documents and query to performrelevancy ranking on the documents. If too many documents are rankedvery highly, the document clustering component 140 performs clusteringon those highly ranked documents to determine which cluster or clustersmost accurately represents the user's intent with the query.

[0177] In FIG. 14, a similar application is illustrated, except that nodocument clustering is required. Whether because of a constrained searchspace or because of a precisely worded user query, only one or a fewdocuments in this instance sufficiently match the original user query tobe presented to the user. Thus, no document clustering component isinvolved.

[0178] In FIG. 15, the user may request a scope-of-the-subject typequery. In such a situation, the user wants to find out what generaltypes of information are available within the current set of knowndocuments. In this case, there is no need for relevancy ranking.Instead, only the document clustering component 140 is used to determinehow the available documents 120 may pertain to the user query. Theresponse to the user lists only the clusters of documents, possiblyincluding one or a few sample documents from each cluster as examples ofthat cluster's contents. This provides the user with a sense of whatkinds of information about a subject are available.

[0179] In FIG. 16, the relevancy ranking and clustering system andmethod are used as the basis for a question- answering system. In suchan application the system first uses the document clustering component140 to identify documents which have a high degree of relevancy to theuser's question. In addition to identifying specific documents likely tocontain the answer to the question, the document clustering component140 identifies one or more windows within each document which are likelyto contain the specific answer to the question.

[0180] Once those specific windows which contain answers to the user'squery are identified, only those windows are passed to the relevancyranking unit 128 for predicate structure-by-predicate structurecomparison to the user query. Some adjustment of the weighting factorsshown in FIG. 3 is needed to optimize the relevancy ranking unit toidentify the specific predicate structures of the document window thatare relevant to the user's question. This process further identifies thespecific sentence or phrases within the document which contain thedesired answer. A separate Answer Formulation Unit 170 then takes thosesentences or phrases and uses them to formulate a natural languageanswer to the user's original query, which is then returned to the user.

[0181] In such a question-answering system, adjusting the dimensionalityof the vectors produced by the vectorization unit controls the size ofthe document window as a system parameter. Using smaller windowsimproves the efficiency of the relevancy ranking unit 128 by reducingthe number of individual predicate structures that must be dealt with.It also, however, forces more work on the document clustering component140 because each document is represented by more vectors. The specificoptimum size for the windows used is determined by experience in aparticular user environment.

What is claimed is:
 1. A relevancy ranking method comprising the stepsof: parsing an input query into at least one query predicate structure;parsing a set of documents to generate at least one document predicatestructure; comparing each of said at least one query predicate structurewith each of said at least one document predicate structure; calculatinga matching degree using a multilevel modifier strategy to assigndifferent relevance values to different parts of each of said at leastone query predicate structure and said at least one document predicatestructure match; and calculating a similarity coefficient based on pairsof said at least one query predicate structure and each of said at leastone document predicate structure to determine relevance of each one ofsaid set of documents to said input query.
 2. A relevancy ranking methodas recited in claim 1, wherein said step of parsing an input query intoat least one predicate structure is performed using an ontologicalparser.
 3. A relevancy ranking method as recited in claim 1, whereinsaid step of parsing a set of documents to generate at least onedocument predicate structure is performed using an ontological parser.4. A relevancy ranking method as recited in claim 1, wherein saidmatching degree is a real number.
 5. A relevancy ranking method asrecited in claim 1, wherein said calculating step comprises the stepsof: dynamically comparing overall predicate structures for each of saidat least one document to said predicate structures for said at least oneuser query and returning a ranking based on a predicate vectorsimilarity measure; comparing each of said at least one query predicatestructure and said at least one document predicate structure andreturning a predicate structure similarity measure; comparing similaritybetween predicate parts of said at least one query predicate structureand said at least one document predicate structure and returning apredicate matching similarity measure; comparing argument parts of saidat least one query predicate structure and said at least one documentpredicate structure and returning an argument similarity measure;comparing concepts of said at least one query predicate structure andsaid at least one document predicate structure and returning a conceptsimilarity measure; and comparing proper nouns of said at least onequery predicate structure and said at least one document predicatestructure and returning a proper noun similarity measure.
 6. A relevancyranking method as recited in claim 1, wherein said step of calculatingsaid matching degree using a multilevel modifier strategy determinessaid relevance values based upon an abstraction level of said at leastone query predicate structure and said at least one document predicatestructure, wherein said match is assigned a small weight when said matchis relatively abstract.
 7. A relevancy ranking method as recited inclaim 6, wherein said abstraction level of said at least one querypredicate structure and said at least one document predicate structurecomprises predicate only matches, argument only matches, and predicateand argument matches, wherein said predicate only matches are moreabstract than said argument only matches, and said argument only matchesare more abstract than said predicate and argument matches.
 8. Arelevancy ranking method as recited in claim 1, wherein said step ofcalculating said matching degree using a multilevel modifier strategydetermines said relevance values based upon concept proximityrepresenting an ontological relationship between two concepts.
 9. Arelevancy ranking method as recited in claim 8, wherein said ontologicalrelationship between two concepts is closer when a difference betweensaid two concepts is smaller, and said matching degree is assigned ahigher relevancy bonus.
 10. A relevancy ranking method as recited inclaim 1, wherein said step of calculating said matching degree using amultilevel modifier strategy determines said relevance values based uponthe location of a predicate in one of said documents in said set ofdocuments.
 11. A relevancy ranking method as recited in claim 10,wherein when said location is disposed in the beginning of saiddocument, said document is assigned a higher relevancy number.
 12. Arelevancy ranking method as recited in claim 1, wherein said step ofcalculating said matching degree using a multilevel modifier strategydetermines said relevance values based upon a degree of proper nounmatching.
 13. A relevancy ranking method as recited in claim 1, whereinsaid step of calculating said matching degree using a multilevelmodifier strategy determines said relevance values based upon a matchingdegree of words having the same word stem.
 14. A relevancy rankingmethod as recited in claim 1, further comprising the step of identifyingeach of said document predicate structures by a predicate key that is aninteger representation, wherein conceptual nearness of two of saiddocument predicate structures is estimated by subtracting correspondingone of said predicate keys.
 15. A relevancy ranking method as recited inclaim 14, comprising the further step of constructing multi-dimensionalvectors using said integer representations.
 16. A relevancy rankingmethod as recited in claim 15, comprising the further step ofnormalizing said multi-dimensional vectors.
 17. A relevancy rankingmethod as recited in claim 1, further comprising the step of identifyingeach of said query predicate structures by a predicate key that is aninteger representation, and constructing multi-dimensional vectors, foreach of said query predicate structures, using said integerrepresentations.
 18. A relevancy ranking method as recited in claim 16,further comprising the step of identifying each of said query predicatestructures by a predicate key that is an integer representation, andconstructing multi-dimensional vectors, for each of said query predicatestructures, using said integer representations.
 19. A relevancy rankingmethod as recited in claim 18, further comprising the steps ofperforming a dot-product operation between multi-dimensional vectors,for each of said query predicate structures and each of saidmulti-dimensional vectors for each of said document predicatestructures, ranking each of said documents in said document set fromlargest dot-product result to smallest dot-product result, and returningsaid rankings.
 20. A relevancy ranking method as recited in claim 1,wherein said step of calculating said matching degree using a multilevelmodifier strategy determines said relevance values based upon a size ofeach of said documents in said set of documents.
 21. A clustering methodcomprising the steps of: parsing an input query into at least one querypredicate structure; vectorizing said input query; identifying each ofsaid query predicate structures by a predicate key that is an integer,and constructing multi-dimensional vectors, for each of said querypredicate structures, using said integers; parsing a plurality ofdocuments into at least one document predicate structure for each ofsaid plurality of documents; vectorizing said set of documents;identifying said at least one document predicate structure by apredicate key that is an integer, wherein conceptual nearness of two ofsaid document predicate structures is estimated by subtractingcorresponding ones of said predicate keys; comparing said at least onequery predicate structure with said plurality of document predicatestructures for a said plurality of documents; clustering similardocuments, within said plurality of documents, where said at least onedocument vector representation matches said at least one query predicatestructure.
 22. A clustering method as recited in claim 21, wherein saidclustering is performed based on patterns of predicate pairs of saidmatching ones of said set of documents.
 23. A clustering method asrecited in claim 22, wherein said clustering step further comprisescomparing said at least one predicate structure of said input query to amap of said clustered matches.
 24. A clustering method as recited inclaim 23, wherein said clustering step further comprises identifyingclusters most likely to fit said input.
 25. A clustering method asrecited in claim 23, wherein said clustering step further comprisesproviding a feedback mechanism so that users can determine if a returnedcluster is a good fit.
 26. A clustering method as recited in claim 23,wherein said clustering step comprises the steps of: self-organizing toadapt a collection said set of documents matching an input query; andidentifying and returning at least one appropriate cluster of saidcollection of documents.
 27. A clustering method as recited in claim 21,wherein said clustering is performed using a neural network, saidclustering step performs said steps of: vectorizing said set ofdocuments and vectorizing said input query; self-organizing saidmatching ones of said set of documents that match said input query; andretrieving clusters of said matching ones of said set of documents thatmatch said input query.
 28. A clustering method as recited in claim 21,wherein said neural network comprises a plurality of neurodes.
 29. Aclustering method as recited in claim 28, wherein said step ofself-organizing said matching ones of said set of documents that matchsaid input query comprises the steps of: developing a said map from saidneurodes; and determining clusters of said plurality of neurodes thatrepresent ones of said documents conceptually near one another.
 30. Arelevancy ranking method as recited in claim 19, comprising the furtherstep of clustering matching ones of said set of documents that matchsaid input query.
 31. A relevancy ranking method as recited in claim 30,wherein said clustering is performed based on patterns of predicatepairs of said matching ones of said set of documents.
 32. A relevancyranking method as recited in claim 31, wherein said clustering stepfurther comprises comparing said at least one predicate structure ofsaid input query to a map of said clustered matches.
 33. A relevancyranking method as recited in claim 32, wherein said clustering stepfurther comprises identifying clusters most likely to fit said input.34. A relevancy ranking method as recited in claim 32, wherein saidclustering step further comprises providing a feedback mechanism so thatusers can determine if a returned cluster is a good fit.
 35. A relevancyranking method as recited in claim 32, wherein said clustering stepcomprises the steps of: self-organizing to adapt a collection of saidset of documents matching an input query; and identifying and returningat least one appropriate cluster of said collection of documents.
 36. Aclustering method as recited in claim 21, further comprising the stepsof: parsing an input query into said at least one predicate structure;vectorizing said input query; parsing said plurality of documents togenerate at least one document predicate structure for each of saidplurality of documents; vectorizing said plurality of documents;comparing each of said at least one query predicate structure with eachof said at least one document predicate structure; calculating amatching degree using a multilevel modifier strategy to assign differentrelevance values to different parts of each of said at least one querypredicate structure and said at least one document predicate structurematch; and calculating a similarity coefficient based on pairs of saidat least one query predicate structure and each of said at least onedocument predicate structure to determine relevance of each one of saidset of documents to said input query.
 37. A clustering method as recitedin claim 36, wherein said step of parsing an input query into at leastone predicate structure is performed using an ontological parser.
 38. Aclustering method as recited in claim 36, wherein said step of parsing aset of documents to generate at least one document predicate structureis performed using an ontological parser.
 39. A clustering method asrecited in claim 36, wherein said matching degree is a real number. 40.A clustering method as recited in claim 36, wherein said step ofcalculating a matching degree comprises the steps of: dynamicallycomparing overall predicate structures for each of said at least onedocument to said predicate structures for said at least one user queryand returning a ranking based on a predicate vector similarity measure;comparing each of said at least one query predicate structure and saidat least one document predicate structure and returning a predicatestructure similarity measure; comparing similarity between predicateparts of said at least one query predicate structure and said at leastone document predicate structure and returning a predicate matchingsimilarity measure; comparing argument parts of said at least one querypredicate structure and said at least one document predicate structureand returning an argument similarity measure; comparing concepts of saidat least one query predicate structure and said at least one documentpredicate structure and returning a concept similarity measure; andcomparing proper nouns of said at least one query predicate structureand said at least one document predicate structure and returning aproper noun similarity measure.
 41. A clustering method as recited inclaim 36, wherein said step of calculating said matching degree using amultilevel modifier strategy determines said relevance values based uponan abstraction level of said at least one query predicate structure andsaid at least one document predicate structure, wherein said match isassigned a small weight when said match is relatively abstract.
 42. Aclustering method as recited in claim 41, wherein said abstraction levelof said at least one query predicate structure and said at least onedocument predicate structure comprises predicate only matches, argumentonly matches, and predicate and argument matches, wherein said verb onlymatches are more abstract than said noun only matches, and said nounonly matches are more abstract than said verb and noun matches.
 43. Aclustering method as recited in claim 36, wherein said step ofcalculating said matching degree using a multilevel modifier strategydetermines said relevance values based upon concept proximityrepresenting an ontological relationship between two concepts.
 44. Aclustering method as recited in claim 43, wherein said ontologicalrelationship between two concepts is closer when a difference betweensaid two concepts is smaller, and said matching degree is assigned ahigher relevancy bonus.
 45. A clustering method as recited in claim 36,wherein said step of calculating said matching degree using a multilevelmodifier strategy determines said relevance values based upon thelocation of a predicate in one of said documents in said set ofdocuments.
 46. A clustering method as recited in claim 45, wherein whensaid location is disposed in the beginning of said document, saiddocument is assigned a higher relevancy number.
 47. A clustering methodas recited in claim 36, wherein said step of calculating said matchingdegree using a multilevel modifier strategy determines said relevancevalues based upon a degree of proper noun matching.
 48. A clusteringmethod as recited in claim 36, wherein said step of calculating saidmatching degree using a multilevel modifier strategy determines saidrelevance values based upon a matching degree of words having the sameword stem.
 49. A clustering method as recited in claim 21, furthercomprising the step of identifying each of said document predicatestructures by a predicate key that is an integer representation, whereinconceptual nearness of two of said document predicate structures isestimated by subtracting corresponding one of said predicate keys.
 50. Aclustering method as recited in claim 49, comprising the further step ofconstructing multi-dimensional vectors using said integerrepresentations.
 51. A clustering method as recited in claim 50,comprising the further step of normalizing said multi-dimensionalvectors.
 52. A clustering method as recited in claim 49, furthercomprising the step of identifying each of said query predicatestructures by a predicate key that is an integer representation, andconstructing multi-dimensional vectors, for each of said query predicatestructures, using said integer representations.
 53. A clustering methodas recited in claim 52, further comprising the step of identifying eachof said query predicate structures by a predicate key that is an integerrepresentation, and constructing multi-dimensional vectors, for each ofsaid query predicate structures, using said integer representations. 54.A method of vectorizing a set of document predicate structures,comprising the steps of: identifying each set of predicates andarguments in said set of predicate structures by predicate keys that areinteger representations, wherein conceptual nearness of two of saiddocument predicate structures is estimated by subtracting correspondingone of said predicate keys.
 55. A method of vectorizing a set ofdocument predicate structures, as recited in claim 54, comprising thefurther step of constructing multi-dimensional vectors using saidinteger representations.
 56. A method of vectorizing a set of documentpredicate structures, as recited in claim 55, comprising the furtherstep of normalizing said multi-dimensional vectors.
 57. A method ofvectorizing a set of query predicate structures, as recited in claim 54,further comprising the step of identifying query predicate structures bypredicate keys that are integer representations, and constructingmulti-dimensional vectors, for each of said query predicate structures,using said integer representations.
 58. A method of vectorizing a set ofquery predicate structures, as recited in claim 56, further comprisingthe step of identifying query predicate structures by predicate keysthat are integer representations, and constructing multi-dimensionalvectors, for each of said query predicate structures, using said integerrepresentations.
 59. A relevancy ranking system comprising: at least oneontological parser to parse an input query into at least one querypredicate structure, and a set of documents each into at least onedocument predicate structure; an input query predicate storage unit thatstores said at least one input query predicate structure; a documentpredicate storage unit that stores said at least one document predicatestructure for each of said documents in said set; a query vectorizationunit that converts said at least one query predicate structure intomultidimensional numerical query vectors; a document vectorization unitthat converts said at least one document predicate structures intomultidimensional numerical document vectors; and a relevancy rankingunit that compares each of said at least one input query predicatestructure with each of said at least one document predicate structure,calculates a matching degree to assign different relevance values todifferent parts of each of said at least one query predicate structureand said at least one document predicate structure match, and calculatesa similarity coefficient based on pairs of said at least one querypredicate structure and each of said at least one document predicatestructure to determine relevance of each one of said set of documents tosaid input query.
 60. A relevancy ranking system as recited in claim 59,wherein said matching degree is a real number.
 61. A relevancy rankingsystem comprising: at least one ontological parser to parse an inputquery into at least one query predicate structure, and a set ofdocuments each into at least one document predicate structure; an inputquery predicate storage unit that stores said at least one input querypredicate structure; a document predicate storage unit that stores saidat least one document predicate structure for each of said documents insaid set; a document vectorization unit that converts said at least onedocument predicate structure into multidimensional numerical vectors; aquery vectorization unit that converts said at least one query predicatestructures into multidimensional numerical vectors; a relevancy rankingunit that compares each of said at least one input query predicatestructure with each of said at least one document predicate structure,calculates a matching degree to assign different relevance values todifferent parts of each of said at least one query predicate structureand said at least one document predicate structure match, and calculatesa similarity coefficient based on pairs of said at least one querypredicate structure and each of said at least one document predicatestructure to determine relevance of each one of said set of documents tosaid input query; and a neural network for providing clusters ofmatching ones of said set of documents that match said input query. 62.A relevancy ranking system as recited in claim 61, further comprising afeedback mechanism so that users can determine if a returned cluster isa good match for said input query.
 63. A relevancy ranking system asrecited in claim 61, wherein said neural network self-organizes andretrieves clusters of said matching ones of said set of documents thatmatch said input query.
 64. A relevancy ranking system as recited inclaim 61, wherein said neural network comprises a plurality of neurodes.65. A relevancy ranking system as recited in claim 59, furthercomprising a feedback mechanism so that users can determine if areturned cluster is a good match for said input query.
 66. A relevancyranking system as recited in claim 59, wherein said neural networkself-organizes and retrieves clusters of said matching ones of said setof documents that match said input query.
 67. A clustering systemcomprising: at least one ontological parser to parse an input query intoat least one query predicate structure, and a set of documents each intoat least one document predicate structure; an input query predicatestorage unit that stores said at least one input query predicatestructure; a document predicate storage unit that stores said at leastone document predicate structure for each of said documents in said set;a document vectorization unit that converts said at least one documentpredicate structure into multidimensional numerical vectorrepresentations; a query vectorization unit that converts said at leastone query predicate structure into multidimensional numerical vectorrepresentations; and a neural network for providing clusters of matchingones of said set of documents that match said input query.
 68. Aquestion and answering system comprising: at least one ontologicalparser to parse an input query into at least one query predicatestructure, and a set of documents each into at least one documentpredicate structure for each of a plurality of documents; a queryvectorization unit that converts said at least one query predicatestructure into multidimensional numerical vector representations,wherein each of said query predicate structures are identified by apredicate key that is an integer, and multi-dimensional vectors for eachof said query predicate structures are constructed using said integers;a document vectorization unit that converts said at least one documentpredicate structure for each of a plurality of documents intomultidimensional numerical vector representations, wherein said at leastone document predicate structure is identified by a predicate key thatis an integer, wherein conceptual nearness of two of said documentpredicate structures is estimated by subtracting corresponding ones ofsaid predicate keys; clustering unit that groups similar documents,within said plurality of documents, where said at least one documentvector representation matches said at least one query predicatestructure; and a relevancy ranking unit that compares said at least onequery predicate structure with said plurality of document predicatestructures for each of said plurality of documents.
 69. A question andanswering system as recited in claim 68, further comprising: an answerformulation unit that provides a natural language response to said inputquery.